In this notebook, we guide you through the creation of an Image Classification Machine Learning Model to distinguish among 133 dog breeds using the dog breed dataset from Udacity (https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip).
We utilize a pre-trained Resnet50 model from the PyTorch Vision library and add two fully connected neural network layers on top of it. Utilizing transfer learning, we freeze the existing convolutional layers in the Resnet50 model and only adjust the gradients for the two fully connected layers.
We then perform hyperparameter tuning to optimize the model. After fine-tuning using the best hyperparameters, we add profiling and debugging configurations to the training and evaluation phases. The final step involves deploying the model by creating a custom inference script for predictions.
Finally, we test the model using test images of dogs to ensure it meets our expectations.
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install smdebug
Keyring is skipped due to an exception: 'keyring.backends'
Collecting smdebug
Downloading smdebug-1.0.12-py2.py3-none-any.whl (270 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 270.1/270.1 kB 2.6 MB/s eta 0:00:0000:01
Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (3.20.3)
Requirement already satisfied: boto3>=1.10.32 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.26.24)
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from smdebug) (20.1)
Requirement already satisfied: numpy>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.21.6)
Collecting pyinstrument==3.4.2
Downloading pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.3/83.3 kB 1.4 MB/s eta 0:00:00:00:01
Collecting pyinstrument-cext>=0.2.2
Downloading pyinstrument_cext-0.2.4-cp37-cp37m-manylinux2010_x86_64.whl (20 kB)
Requirement already satisfied: botocore<1.30.0,>=1.29.24 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.29.24)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.0.1)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (0.6.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (2.4.6)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (1.14.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3>=1.10.32->smdebug) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3>=1.10.32->smdebug) (1.26.13)
Installing collected packages: pyinstrument-cext, pyinstrument, smdebug
Successfully installed pyinstrument-3.4.2 pyinstrument-cext-0.2.4 smdebug-1.0.12
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: pip install --upgrade pip
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import sagemaker
import boto3
from sagemaker.session import Session
from sagemaker import get_execution_role
# Initializing some useful variables
role = get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()
print(f"Region {region}")
print(f"Default s3 bucket : {bucket}")
Region us-west-2 Default s3 bucket : sagemaker-us-west-2-232496288858
For this project, we used the dogImages dataset available at this link. It comprises images of 133 dog breeds, divided into train, validation, and test folders, each containing examples of every breed. For instance, the path to a sample image in the train folder is ./dogImages/test/018.Beauceron/Beauceron_01284.jpg.
#TODO: Fetch and upload the data to AWS S3
# Command to download and unzip data
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip > /dev/null
--2023-02-02 18:31:55-- https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.194.0 Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.194.0|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1132023110 (1.1G) [application/zip] Saving to: ‘dogImages.zip’ dogImages.zip 100%[===================>] 1.05G 71.3MB/s in 17s 2023-02-02 18:32:14 (64.6 MB/s) - ‘dogImages.zip’ saved [1132023110/1132023110]
prefix ="dogImagesDataset"
print("Starting to upload dogImages")
inputs = sagemaker_session.upload_data(path="dogImages", bucket=bucket, key_prefix=prefix)
print(f"Input path ( S3 file path ): {inputs}")
Starting to upload dogImages Input path ( S3 file path ): s3://sagemaker-us-west-2-232496288858/dogImagesDataset
For this image classification problem, we used a ResNet50 model with two fully connected linear neural network layers. ResNet-50 is a deep model, 50 layers deep, trained on one million images of 1000 categories from the ImageNet database, making it suitable for image recognition tasks. The optimizer we used is AdamW (for more information, see https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html). During the hyperparameter tuning process, the following hyperparameters were selected: Learning rate - default (0.001) with a range of 0.01 to 100; eps - default (1e-08) with a range of 1e-09 to 1e-08; weight decay - default (0.01) with a range of 0.1 to 10; batch size - only two values (64 and 128) were selected.
Note: You will need to use the hpo.py script to perform hyperparameter tuning.
#Importing all the required modules fomr tuner
from sagemaker.tuner import (
CategoricalParameter,
ContinuousParameter,
HyperparameterTuner
)
# We wil be using AdamW as an optimizer which uses a different( more correct or better) way to calulate the weight decay related computations
# So we will be using weight_decay and eps hyperparamter tuning as well , along with the lerning rate and batchsize params
hyperparameter_ranges = {
"lr": ContinuousParameter(0.0001, 0.1),
"eps": ContinuousParameter(1e-9, 1e-8),
"weight_decay": ContinuousParameter(1e-3, 1e-1),
"batch_size": CategoricalParameter([ 64, 128]),
}
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point = "hpo.py",
base_job_name = "dog-breed-classification-hpo",
role = role,
instance_count = 1,
instance_type = "ml.p3.2xlarge",
py_version = "py36",
framework_version = "1.8"
)
tuner = HyperparameterTuner(
estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions,
max_jobs=4,
max_parallel_jobs=1,
objective_type=objective_type,
early_stopping_type="Auto"
)
# TODO: Fit your HP Tuner
tuner.fit({"training": inputs }, wait=True)
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
......................................................................................................................................................................................................................................................................................................................................................!
# Get the best estimators and the best HPs
best_estimator = tuner.best_estimator()
#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()
2023-02-02 21:25:00 Starting - Found matching resource for reuse 2023-02-02 21:25:00 Downloading - Downloading input data 2023-02-02 21:25:00 Training - Training image download completed. Training in progress. 2023-02-02 21:25:00 Uploading - Uploading generated training model 2023-02-02 21:25:00 Completed - Resource reused by training job: pytorch-training-230202-2107-003-62d9c381
{'_tuning_objective_metric': '"average test loss"',
'batch_size': '"64"',
'eps': '8.22789935548792e-09',
'lr': '0.00023934090400595828',
'sagemaker_container_log_level': '20',
'sagemaker_estimator_class_name': '"PyTorch"',
'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
'sagemaker_job_name': '"dog-breed-classification-hpo-2023-02-02-21-07-24-934"',
'sagemaker_program': '"hpo.py"',
'sagemaker_region': '"us-west-2"',
'sagemaker_submit_directory': '"s3://sagemaker-us-west-2-232496288858/dog-breed-classification-hpo-2023-02-02-21-07-24-934/source/sourcedir.tar.gz"',
'weight_decay': '0.0013574056448429658'}
best_hyperparameters={'batch_size': int(best_estimator.hyperparameters()['batch_size'].replace('"', "")),
'eps': best_estimator.hyperparameters()['eps'],
'lr': best_estimator.hyperparameters()['lr'],
'weight_decay': best_estimator.hyperparameters()['weight_decay'],}
print(f"Best Hyperparamters post Hyperparameter fine tuning are : \n {best_hyperparameters}")
Best Hyperparamters post Hyperparameter fine tuning are :
{'batch_size': 64, 'eps': '8.22789935548792e-09', 'lr': '0.00023934090400595828', 'weight_decay': '0.0013574056448429658'}
TODO: Using the best hyperparameters, create and finetune a new model
Note: You will need to use the train_model.py script to perform model profiling and debugging.
# Setting up debugger and profiler rules and configs
from sagemaker.debugger import (
Rule,
rule_configs,
ProfilerRule,
DebuggerHookConfig,
CollectionConfig,
ProfilerConfig,
FrameworkProfile
)
rules = [
Rule.sagemaker(rule_configs.vanishing_gradient()),
Rule.sagemaker(rule_configs.overfit()),
Rule.sagemaker(rule_configs.overtraining()),
Rule.sagemaker(rule_configs.poor_weight_initialization()),
ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]
profiler_config = ProfilerConfig(
system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)
collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0",parameters={
"include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "10","eval.save_interval": "1"})]
debugger_config=DebuggerHookConfig( collection_configs=collection_configs )
# Create and fit an estimator
estimator = PyTorch(
entry_point="train_model.py",
instance_count=1,
instance_type="ml.p3.2xlarge",
role=role,
framework_version="1.6", #using 1.6 as it has support for smdebug lib , https://github.com/awslabs/sagemaker-debugger#debugger-supported-frameworks
py_version="py36",
hyperparameters=best_hyperparameters,
profiler_config=profiler_config, # include the profiler hook
debugger_hook_config=debugger_config, # include the debugger hook
rules=rules
)
estimator.fit({'train' : inputs },wait=True)
2023-02-02 21:42:09 Starting - Starting the training job... 2023-02-02 21:42:35 Starting - Preparing the instances for trainingVanishingGradient: InProgress Overfit: InProgress Overtraining: InProgress PoorWeightInitialization: InProgress ProfilerReport: InProgress ...... 2023-02-02 21:43:37 Downloading - Downloading input data...... 2023-02-02 21:44:34 Training - Downloading the training image...... 2023-02-02 21:45:34 Training - Training image download completed. Training in progress...bash: cannot set terminal process group (-1): Inappropriate ioctl for device bash: no job control in this shell 2023-02-02 21:45:54,173 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training 2023-02-02 21:45:54,209 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed. 2023-02-02 21:45:54,212 sagemaker_pytorch_container.training INFO Invoking user training script. 2023-02-02 21:45:54,512 sagemaker-training-toolkit INFO Invoking user script Training Env: { "additional_framework_parameters": {}, "channel_input_dirs": { "train": "/opt/ml/input/data/train" }, "current_host": "algo-1", "framework_module": "sagemaker_pytorch_container.training:main", "hosts": [ "algo-1" ], "hyperparameters": { "batch_size": 64, "eps": "8.22789935548792e-09", "lr": "0.00023934090400595828", "weight_decay": "0.0013574056448429658" }, "input_config_dir": "/opt/ml/input/config", "input_data_config": { "train": { "TrainingInputMode": "File", "S3DistributionType": "FullyReplicated", "RecordWrapperType": "None" } }, "input_dir": "/opt/ml/input", "is_master": true, "job_name": "pytorch-training-2023-02-02-21-42-08-475", "log_level": 20, "master_hostname": "algo-1", "model_dir": "/opt/ml/model", "module_dir": "s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz", "module_name": "train_model", "network_interface_name": "eth0", "num_cpus": 8, "num_gpus": 1, "output_data_dir": "/opt/ml/output/data", "output_dir": "/opt/ml/output", "output_intermediate_dir": "/opt/ml/output/intermediate", "resource_config": { "current_host": "algo-1", "current_instance_type": "ml.p3.2xlarge", "current_group_name": "homogeneousCluster", "hosts": [ "algo-1" ], "instance_groups": [ { "instance_group_name": "homogeneousCluster", "instance_type": "ml.p3.2xlarge", "hosts": [ "algo-1" ] } ], "network_interface_name": "eth0" }, "user_entry_point": "train_model.py" } Environment variables: SM_HOSTS=["algo-1"] SM_NETWORK_INTERFACE_NAME=eth0 SM_HPS={"batch_size":64,"eps":"8.22789935548792e-09","lr":"0.00023934090400595828","weight_decay":"0.0013574056448429658"} SM_USER_ENTRY_POINT=train_model.py SM_FRAMEWORK_PARAMS={} SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.p3.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.p3.2xlarge"}],"network_interface_name":"eth0"} SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}} SM_OUTPUT_DATA_DIR=/opt/ml/output/data SM_CHANNELS=["train"] SM_CURRENT_HOST=algo-1 SM_MODULE_NAME=train_model SM_LOG_LEVEL=20 SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main SM_INPUT_DIR=/opt/ml/input SM_INPUT_CONFIG_DIR=/opt/ml/input/config SM_OUTPUT_DIR=/opt/ml/output SM_NUM_CPUS=8 SM_NUM_GPUS=1 SM_MODEL_DIR=/opt/ml/model SM_MODULE_DIR=s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":64,"eps":"8.22789935548792e-09","lr":"0.00023934090400595828","weight_decay":"0.0013574056448429658"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"pytorch-training-2023-02-02-21-42-08-475","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz","module_name":"train_model","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.p3.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.p3.2xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train_model.py"} SM_USER_ARGS=["--batch_size","64","--eps","8.22789935548792e-09","--lr","0.00023934090400595828","--weight_decay","0.0013574056448429658"] SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate SM_CHANNEL_TRAIN=/opt/ml/input/data/train SM_HP_BATCH_SIZE=64 SM_HP_EPS=8.22789935548792e-09 SM_HP_LR=0.00023934090400595828 SM_HP_WEIGHT_DECAY=0.0013574056448429658 PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages Invoking script with the following command: /opt/conda/bin/python3.6 train_model.py --batch_size 64 --eps 8.22789935548792e-09 --lr 0.00023934090400595828 --weight_decay 0.0013574056448429658 [2023-02-02 21:45:55.407 algo-1:27 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None [2023-02-02 21:45:55.652 algo-1:27 INFO profiler_config_parser.py:102] Using config at /opt/ml/input/config/profilerconfig.json. Running on Device cuda:0 Hyperparameters : LR: 0.00023934090400595828, Eps: 8.22789935548792e-09, Weight-decay: 0.0013574056448429658, Batch Size: 64, Epoch: 2 Data Dir Path: /opt/ml/input/data/train Model Dir Path: /opt/ml/model Output Dir Path: /opt/ml/output/data [2023-02-02 21:46:00.572 algo-1:27 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json. [2023-02-02 21:46:00.574 algo-1:27 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries. [2023-02-02 21:46:00.575 algo-1:27 INFO hook.py:253] Saving to /opt/ml/output/tensors [2023-02-02 21:46:00.576 algo-1:27 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist. [2023-02-02 21:46:00.606 algo-1:27 INFO hook.py:584] name:fc.0.weight count_params:524288 [2023-02-02 21:46:00.607 algo-1:27 INFO hook.py:584] name:fc.0.bias count_params:256 [2023-02-02 21:46:00.607 algo-1:27 INFO hook.py:584] name:fc.2.weight count_params:34048 [2023-02-02 21:46:00.608 algo-1:27 INFO hook.py:584] name:fc.2.bias count_params:133 [2023-02-02 21:46:00.608 algo-1:27 INFO hook.py:586] Total Trainable Params: 558725 Epoch 1 - Starting Training phase. Epoch: 1 - Training Model on Complete Training Dataset! [2023-02-02 21:46:01.747 algo-1:27 INFO hook.py:413] Monitoring the collections: gradients, CrossEntropyLoss_output_0, relu_input, losses [2023-02-02 21:46:01.749 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/prestepzero-*-start-1675374355652793.2_train-0-stepstart-1675374361749233.0/python_stats. [2023-02-02 21:46:01.768 algo-1:27 INFO hook.py:476] Hook is writing from the hook with pid: 27 [2023-02-02 21:46:12.376 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-0-stepstart-1675374361762277.0_train-0-forwardpassend-1675374372375976.8/python_stats. [2023-02-02 21:46:13.425 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-0-forwardpassend-1675374372385274.0_train-1-stepstart-1675374373424226.2/python_stats. [2023-02-02 21:46:17.588 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-1-stepstart-1675374373430919.2_train-1-forwardpassend-1675374377588009.8/python_stats. [2023-02-02 21:46:18.463 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-1-forwardpassend-1675374377591750.0_train-2-stepstart-1675374378462369.8/python_stats. [2023-02-02 21:46:22.450 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-2-stepstart-1675374378466878.5_train-2-forwardpassend-1675374382408504.0/python_stats. [2023-02-02 21:46:23.425 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-2-forwardpassend-1675374382451907.0_train-3-stepstart-1675374383424827.8/python_stats. [2023-02-02 21:46:27.395 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-3-stepstart-1675374383428854.8_train-3-forwardpassend-1675374387394790.0/python_stats. [2023-02-02 21:46:28.687 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-3-forwardpassend-1675374387397322.8_train-4-stepstart-1675374388686763.0/python_stats. [2023-02-02 21:46:32.160 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-4-stepstart-1675374388694483.5_train-4-forwardpassend-1675374392159810.5/python_stats. [2023-02-02 21:46:33.176 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-4-forwardpassend-1675374392161803.0_train-5-stepstart-1675374393175468.0/python_stats. [2023-02-02 21:46:36.537 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-5-stepstart-1675374393180429.5_train-5-forwardpassend-1675374396537505.8/python_stats. [2023-02-02 21:46:37.280 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-5-forwardpassend-1675374396539309.8_train-6-stepstart-1675374397279402.8/python_stats. [2023-02-02 21:46:40.664 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-6-stepstart-1675374397283719.5_train-6-forwardpassend-1675374400664573.8/python_stats. [2023-02-02 21:46:41.566 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-6-forwardpassend-1675374400666649.0_train-7-stepstart-1675374401565724.2/python_stats. [2023-02-02 21:46:44.883 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-7-stepstart-1675374401570113.2_train-7-forwardpassend-1675374404882778.0/python_stats. [2023-02-02 21:46:45.788 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-7-forwardpassend-1675374404884567.0_train-8-stepstart-1675374405788208.2/python_stats. [2023-02-02 21:46:49.176 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-8-stepstart-1675374405792509.0_train-8-forwardpassend-1675374409176302.5/python_stats. [2023-02-02 21:46:50.198 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-8-forwardpassend-1675374409178382.2_train-9-stepstart-1675374410198045.0/python_stats. [2023-02-02 21:46:53.590 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-9-stepstart-1675374410202298.8_train-9-forwardpassend-1675374413589852.5/python_stats. [2023-02-02 21:46:54.623 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-9-forwardpassend-1675374413591964.8_train-10-stepstart-1675374414622837.2/python_stats. Train set: Average loss: 4.5436, Accuracy: 1105/6680 (17%) Epoch 1 - Starting Testing phase. Epoch: 1 - Testing Model on Complete Testing Dataset! Test set: Average loss: 4.1145, Accuracy: 216/836 (26%) Epoch 2 - Starting Training phase. Epoch: 2 - Training Model on Complete Training Dataset! Train set: Average loss: 3.8656, Accuracy: 2074/6680 (31%) Epoch 2 - Starting Testing phase. Epoch: 2 - Testing Model on Complete Testing Dataset! Test set: Average loss: 3.6677, Accuracy: 287/836 (34%) Starting to Save the Model Completed Saving the Model INFO:__main__:Running on Device cuda:0 INFO:__main__:Hyperparameters : LR: 0.00023934090400595828, Eps: 8.22789935548792e-09, Weight-decay: 0.0013574056448429658, Batch Size: 64, Epoch: 2 INFO:__main__:Data Dir Path: /opt/ml/input/data/train INFO:__main__:Model Dir Path: /opt/ml/model INFO:__main__:Output Dir Path: /opt/ml/output/data Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth #015 0%| | 0.00/97.8M [00:00<?, ?B/s]#015 4%|▍ | 3.98M/97.8M [00:00<00:02, 41.8MB/s]#015 9%|▉ | 9.01M/97.8M [00:00<00:02, 44.5MB/s]#015 15%|█▍ | 14.5M/97.8M [00:00<00:01, 47.7MB/s]#015 21%|██ | 20.3M/97.8M [00:00<00:01, 51.1MB/s]#015 27%|██▋ | 26.3M/97.8M [00:00<00:01, 54.0MB/s]#015 33%|███▎ | 32.4M/97.8M [00:00<00:01, 56.8MB/s]#015 40%|███▉ | 38.6M/97.8M [00:00<00:01, 59.0MB/s]#015 46%|████▌ | 44.8M/97.8M [00:00<00:00, 60.6MB/s]#015 52%|█████▏ | 51.0M/97.8M [00:00<00:00, 61.8MB/s]#015 59%|█████▊ | 57.3M/97.8M [00:01<00:00, 63.0MB/s]#015 65%|██████▍ | 63.2M/97.8M [00:01<00:00, 62.7MB/s]#015 71%|███████ | 69.4M/97.8M [00:01<00:00, 63.4MB/s]#015 77%|███████▋ | 75.6M/97.8M [00:01<00:00, 63.7MB/s]#015 84%|████████▎ | 81.8M/97.8M [00:01<00:00, 64.0MB/s]#015 90%|████████▉ | 87.9M/97.8M [00:01<00:00, 64.2MB/s]#015 96%|█████████▋| 94.2M/97.8M [00:01<00:00, 64.6MB/s]#015100%|██████████| 97.8M/97.8M [00:01<00:00, 62.0MB/s] INFO:__main__:Epoch 1 - Starting Training phase. INFO:__main__:Epoch: 1 - Training Model on Complete Training Dataset! INFO:__main__: Train set: Average loss: 4.5436, Accuracy: 1105/6680 (17%) INFO:__main__:Epoch 1 - Starting Testing phase. INFO:__main__:Epoch: 1 - Testing Model on Complete Testing Dataset! INFO:__main__: Test set: Average loss: 4.1145, Accuracy: 216/836 (26%) INFO:__main__:Epoch 2 - Starting Training phase. INFO:__main__:Epoch: 2 - Training Model on Complete Training Dataset! INFO:__main__: Train set: Average loss: 3.8656, Accuracy: 2074/6680 (31%) INFO:__main__:Epoch 2 - Starting Testing phase. INFO:__main__:Epoch: 2 - Testing Model on Complete Testing Dataset! 2023-02-02 21:50:25,716 sagemaker-training-toolkit INFO Reporting training SUCCESS INFO:__main__: Test set: Average loss: 3.6677, Accuracy: 287/836 (34%) INFO:__main__:Starting to Save the Model INFO:__main__:Completed Saving the Model VanishingGradient: Error Overfit: InProgress Overtraining: InProgress PoorWeightInitialization: InProgress 2023-02-02 21:51:36 Uploading - Uploading generated training modelVanishingGradient: Error Overfit: InProgress Overtraining: IssuesFound PoorWeightInitialization: Error 2023-02-02 21:52:04 Completed - Training job completed VanishingGradient: Error Overfit: NoIssuesFound Overtraining: IssuesFound PoorWeightInitialization: Error Training seconds: 493 Billable seconds: 493
#fetching jobname , client and description to be used for plotting.
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=estimator.latest_training_job.name)
print(f"Jobname: {job_name}")
print(f"Client: {client}")
print(f"Description: {description}")
Jobname: pytorch-training-2023-02-02-21-42-08-475
Client: <botocore.client.SageMaker object at 0x7f7cff3789d0>
Description: {'TrainingJobName': 'pytorch-training-2023-02-02-21-42-08-475', 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:training-job/pytorch-training-2023-02-02-21-42-08-475', 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/output/model.tar.gz'}, 'TrainingJobStatus': 'Completed', 'SecondaryStatus': 'Completed', 'HyperParameters': {'batch_size': '64', 'eps': '"8.22789935548792e-09"', 'lr': '"0.00023934090400595828"', 'sagemaker_container_log_level': '20', 'sagemaker_job_name': '"pytorch-training-2023-02-02-21-42-08-475"', 'sagemaker_program': '"train_model.py"', 'sagemaker_region': '"us-west-2"', 'sagemaker_submit_directory': '"s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz"', 'weight_decay': '"0.0013574056448429658"'}, 'AlgorithmSpecification': {'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.6-gpu-py36', 'TrainingInputMode': 'File', 'EnableSageMakerMetricsTimeSeries': True}, 'RoleArn': 'arn:aws:iam::232496288858:role/service-role/AmazonSageMaker-ExecutionRole-20230202T190502', 'InputDataConfig': [{'ChannelName': 'train', 'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-232496288858/dogImagesDataset', 'S3DataDistributionType': 'FullyReplicated'}}, 'CompressionType': 'None', 'RecordWrapperType': 'None'}], 'OutputDataConfig': {'KmsKeyId': '', 'S3OutputPath': 's3://sagemaker-us-west-2-232496288858/'}, 'ResourceConfig': {'InstanceType': 'ml.p3.2xlarge', 'InstanceCount': 1, 'VolumeSizeInGB': 30}, 'StoppingCondition': {'MaxRuntimeInSeconds': 86400}, 'CreationTime': datetime.datetime(2023, 2, 2, 21, 42, 9, 93000, tzinfo=tzlocal()), 'TrainingStartTime': datetime.datetime(2023, 2, 2, 21, 43, 37, tzinfo=tzlocal()), 'TrainingEndTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 16, 683000, tzinfo=tzlocal()), 'SecondaryStatusTransitions': [{'Status': 'Starting', 'StartTime': datetime.datetime(2023, 2, 2, 21, 42, 9, 93000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 43, 37, tzinfo=tzlocal()), 'StatusMessage': 'Preparing the instances for training'}, {'Status': 'Downloading', 'StartTime': datetime.datetime(2023, 2, 2, 21, 43, 37, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 44, 32, 551000, tzinfo=tzlocal()), 'StatusMessage': 'Downloading input data'}, {'Status': 'Training', 'StartTime': datetime.datetime(2023, 2, 2, 21, 44, 32, 551000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 51, 30, 100000, tzinfo=tzlocal()), 'StatusMessage': 'Training image download completed. Training in progress.'}, {'Status': 'Uploading', 'StartTime': datetime.datetime(2023, 2, 2, 21, 51, 30, 100000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'StatusMessage': 'Uploading generated training model'}, {'Status': 'Completed', 'StartTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'StatusMessage': 'Training job completed'}], 'EnableNetworkIsolation': False, 'EnableInterContainerTrafficEncryption': False, 'EnableManagedSpotTraining': False, 'TrainingTimeInSeconds': 493, 'BillableTimeInSeconds': 493, 'DebugHookConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-232496288858/', 'CollectionConfigurations': [{'CollectionName': 'relu_input', 'CollectionParameters': {'include_regex': '.*relu_input', 'save_interval': '500'}}, {'CollectionName': 'CrossEntropyLoss_output_0', 'CollectionParameters': {'eval.save_interval': '1', 'include_regex': 'CrossEntropyLoss_output_0', 'train.save_interval': '10'}}, {'CollectionName': 'gradients', 'CollectionParameters': {'save_interval': '500'}}]}, 'DebugRuleConfigurations': [{'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'VanishingGradient'}}, {'RuleConfigurationName': 'Overfit', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'Overfit'}}, {'RuleConfigurationName': 'Overtraining', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'Overtraining'}}, {'RuleConfigurationName': 'PoorWeightInitialization', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'PoorWeightInitialization'}}], 'DebugRuleEvaluationStatuses': [{'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-vanishinggradient-d9566825', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'Overfit', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-overfit-c5e75be0', 'RuleEvaluationStatus': 'NoIssuesFound', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'Overtraining', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-overtraining-d9fd7780', 'RuleEvaluationStatus': 'IssuesFound', 'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule Overtraining at step 116 resulted in the condition being met\n', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'PoorWeightInitialization', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-poorweightinitialization-5bed3620', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}], 'ProfilerConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-232496288858/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}, 'ProfilerRuleConfigurations': [{'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}}], 'ProfilerRuleEvaluationStatuses': [{'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-profilerreport-f095e87b', 'RuleEvaluationStatus': 'NoIssuesFound', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 16, 672000, tzinfo=tzlocal())}], 'ProfilingStatus': 'Enabled', 'ResponseMetadata': {'RequestId': '53eeb283-6c04-4416-8b65-c8c3f17946fa', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '53eeb283-6c04-4416-8b65-c8c3f17946fa', 'content-type': 'application/x-amz-json-1.1', 'content-length': '7013', 'date': 'Thu, 02 Feb 2023 21:52:27 GMT'}, 'RetryAttempts': 0}}
TODO: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?
TODO: If not, suppose there was an error. What would that error look like and how would you have fixed it?
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
#creating a trial
trial = create_trial(estimator.latest_job_debugger_artifacts_path())
[2023-02-02 21:53:30.297 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None [2023-02-02 21:53:30.330 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/debug-output
trial.tensor_names() #all the tensor names
[2023-02-02 21:53:49.053 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec. [2023-02-02 21:53:50.073 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO trial.py:210] Loaded all steps
['CrossEntropyLoss_output_0', 'gradient/ResNet_fc.0.bias', 'gradient/ResNet_fc.0.weight', 'gradient/ResNet_fc.2.bias', 'gradient/ResNet_fc.2.weight', 'layer1.0.relu_input_0', 'layer1.0.relu_input_1', 'layer1.0.relu_input_2', 'layer1.1.relu_input_0', 'layer1.1.relu_input_1', 'layer1.1.relu_input_2', 'layer1.2.relu_input_0', 'layer1.2.relu_input_1', 'layer1.2.relu_input_2', 'layer2.0.relu_input_0', 'layer2.0.relu_input_1', 'layer2.0.relu_input_2', 'layer2.1.relu_input_0', 'layer2.1.relu_input_1', 'layer2.1.relu_input_2', 'layer2.2.relu_input_0', 'layer2.2.relu_input_1', 'layer2.2.relu_input_2', 'layer2.3.relu_input_0', 'layer2.3.relu_input_1', 'layer2.3.relu_input_2', 'layer3.0.relu_input_0', 'layer3.0.relu_input_1', 'layer3.0.relu_input_2', 'layer3.1.relu_input_0', 'layer3.1.relu_input_1', 'layer3.1.relu_input_2', 'layer3.2.relu_input_0', 'layer3.2.relu_input_1', 'layer3.2.relu_input_2', 'layer3.3.relu_input_0', 'layer3.3.relu_input_1', 'layer3.3.relu_input_2', 'layer3.4.relu_input_0', 'layer3.4.relu_input_1', 'layer3.4.relu_input_2', 'layer3.5.relu_input_0', 'layer3.5.relu_input_1', 'layer3.5.relu_input_2', 'layer4.0.relu_input_0', 'layer4.0.relu_input_1', 'layer4.0.relu_input_2', 'layer4.1.relu_input_0', 'layer4.1.relu_input_1', 'layer4.1.relu_input_2', 'layer4.2.relu_input_0', 'layer4.2.relu_input_1', 'layer4.2.relu_input_2', 'relu_input_0']
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.TRAIN))
21
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.EVAL))
28
#Defining some utility functions to be used for plotting tensors
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot
#utility function to get data from tensors
def get_data(trial, tname, mode):
tensor = trial.tensor(tname)
steps = tensor.steps(mode=mode)
vals = []
for s in steps:
vals.append(tensor.value(s, mode=mode))
return steps, vals
#plot tensor utility functions for plotting tensors
def plot_tensor(trial, tensor_name):
steps_train, vals_train = get_data(trial, tensor_name, mode=ModeKeys.TRAIN)
print("loaded TRAIN data")
steps_eval, vals_eval = get_data(trial, tensor_name, mode=ModeKeys.EVAL)
print("loaded EVAL data")
fig = plt.figure(figsize=(10, 7))
host = host_subplot(111)
par = host.twiny()
host.set_xlabel("Steps (TRAIN)")
par.set_xlabel("Steps (EVAL)")
host.set_ylabel(tensor_name)
(p1,) = host.plot(steps_train, vals_train, label=tensor_name)
print("Completed TRAIN plot")
(p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
print("Completed EVAL plot")
leg = plt.legend()
host.xaxis.get_label().set_color(p1.get_color())
leg.texts[0].set_color(p1.get_color())
par.xaxis.get_label().set_color(p2.get_color())
leg.texts[1].set_color(p2.get_color())
plt.ylabel(tensor_name)
plt.show()
#plotting the tensor
plot_tensor(trial, "CrossEntropyLoss_output_0");
loaded TRAIN data loaded EVAL data Completed TRAIN plot Completed EVAL plot
# TODO: Display the profiler output
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"Profiler report location: {rule_output_path}")
Profiler report location: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output
! aws s3 ls {rule_output_path} --recursive
2023-02-02 21:52:02 380974 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.html 2023-02-02 21:52:02 230126 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb 2023-02-02 21:51:58 191 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json 2023-02-02 21:51:58 13612 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json 2023-02-02 21:51:58 126 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json 2023-02-02 21:51:58 129 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json 2023-02-02 21:51:58 1008 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json 2023-02-02 21:51:58 309 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json 2023-02-02 21:51:58 153 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json 2023-02-02 21:51:58 232 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json 2023-02-02 21:51:58 1057 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json 2023-02-02 21:51:58 610 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json 2023-02-02 21:51:58 2462 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/StepOutlier.json
! aws s3 cp {rule_output_path} ./ --recursive
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json to ProfilerReport/profiler-output/profiler-reports/BatchSize.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json to ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb to ProfilerReport/profiler-output/profiler-report.ipynb download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json to ProfilerReport/profiler-output/profiler-reports/Dataloader.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json to ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json to ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json to ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json to ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json to ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json to ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/StepOutlier.json to ProfilerReport/profiler-output/profiler-reports/StepOutlier.json download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.html to ProfilerReport/profiler-output/profiler-report.html download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json to ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json
import os
import IPython
# get the autogenerated folder name of profiler report
profiler_report_name = [
rule["RuleConfigurationName"]
for rule in estimator.latest_training_job.rule_job_summary()
if "Profiler" in rule["RuleConfigurationName"]
][0]
# Zipping the ProfilerReport inorder to export and upload it later for submission
import shutil
shutil.make_archive("./profiler_report", "zip", "ProfilerReport")
'/root/CD0387-deep-learning-topics-within-computer-vision-nlp-project-starter/profiler_report.zip'
# TODO: Deploy your model to an endpoint
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge")
--------!
from sagemaker.pytorch import PyTorchModel
from sagemaker.predictor import Predictor
#Below is the s3 location of our saved model that was trained by the training job using the best hyperparameters
model_data_artifacts = "s3://sagemaker-us-west-2-232496288858/pytorch-training-230202-2107-002-dcffdac6/output/model.tar.gz"
#We need to define the serializer and deserializer that we will be using as default for our Prediction purposes
jpeg_serializer = sagemaker.serializers.IdentitySerializer("image/jpeg")
json_deserializer = sagemaker.deserializers.JSONDeserializer()
#If we need to override the serializer and deserializer then we need to pass them in an class inheriting the Predictor class and pass this class as parameter to our PyTorchModel
class ImgPredictor(Predictor):
def __init__( self, endpoint_name, sagemaker_session):
super( ImgPredictor, self).__init__(
endpoint_name,
sagemaker_session = sagemaker_session,
serializer = jpeg_serializer,
deserializer = json_deserializer
)
pytorch_model = PyTorchModel( model_data = model_data_artifacts,
role = role,
entry_point= "endpoint_inference.py",
py_version = "py36",
framework_version = "1.6",
predictor_cls = ImgPredictor
)
predictor = pytorch_model.deploy( initial_instance_count = 1, instance_type = "ml.t2.medium") #Using ml.t2.medium to save costs
-------------!
#Testing the deployed endpoint using some test images
#Solution 1: Using the Predictor object directly.
from PIL import Image
import io
import os
import numpy as np
test_dir = "./dogImages/test/129.Tibetan_mastiff/"
test_images = ["Tibetan_mastiff_08158.jpg", "Tibetan_mastiff_08139.jpg", "Tibetan_mastiff_08138.jpg"]
test_images_expected_output = [129, 5, 21 ]
for index in range(len(test_images) ):
test_img = test_images[index]
expected_breed_category = test_images_expected_output[index]
print(f"Test image no: {index+1}")
test_file_path = os.path.join(test_dir,test_img)
with open(test_file_path , "rb") as f:
payload = f.read()
print("Below is the image that we will be testing:")
display(Image.open(io.BytesIO(payload)))
print(f"Expected dog breed category no : {expected_breed_category}")
response = predictor.predict(payload, initial_args={"ContentType": "image/jpeg"})
print(f"Response: {response}")
predicted_dog_breed = np.argmax(response,1) + 1 #We need to do plus 1 as index starts from zero and prediction is zero-indexed
print(f"Response/Inference for the above image is : {predicted_dog_breed}")
print("----------------------------------------------------------------------")
Test image no: 1 Below is the image that we will be testing:
Expected dog breed category no : 129 Response: [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.662408173084259, 0.05551016703248024, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.21218474209308624, 0.0, 0.0, 0.0, 0.0, 0.9146386384963989, 0.5780836939811707, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.1485905647277832, 0.0, 0.01837325096130371, 0.0, 0.0, 0.0, 0.3390808701515198, 0.0, 0.0, 0.0, 0.8508307933807373, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.6743180751800537, 0.0, 0.0, 1.4576091766357422, 0.9207280278205872, 0.0, 0.0, 0.0, 0.9517050981521606, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7739618420600891, 0.5983347296714783, 0.0, 0.0, 0.2721315026283264, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.917948842048645, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0662590265274048, 0.0, 0.0, 0.0, 0.0, 0.5779383778572083, 0.5139908790588379, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.4284725189208984, 0.0, 0.0, 0.0, 0.0]] Response/Inference for the above image is : [129] ---------------------------------------------------------------------- Test image no: 2 Below is the image that we will be testing:
Expected dog breed category no : 5 Response: [[0.0, 0.08648087084293365, 0.07022416591644287, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5109014511108398, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011233434081077576, 0.0, 0.0, 0.0, 0.0, 1.5123292207717896, 0.7675759792327881, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2232207953929901, 0.0, 0.40191733837127686, 0.0, 0.0, 0.0, 0.8265922665596008, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08851224929094315, 0.0, 1.5181233882904053, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.390899181365967, 0.0, 0.0, 1.4006421566009521, 1.336551547050476, 0.0, 0.0, 0.0, 0.6995121836662292, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9968060255050659, 1.3887072801589966, 0.0, 0.0, 0.3057568073272705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5242455005645752, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0574803352355957, 0.0, 0.0, 0.0, 0.0, 0.48372882604599, 0.5821180939674377, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.226931571960449, 0.0, 0.0, 0.0, 0.0]] Response/Inference for the above image is : [76] ---------------------------------------------------------------------- Test image no: 3 Below is the image that we will be testing:
Expected dog breed category no : 21 Response: [[0.0, 0.47988080978393555, 0.0534096360206604, 0.0, 2.029130697250366, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8579487204551697, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.42556288838386536, 0.5418018102645874, 0.0, 0.0, 0.0, 0.19376616179943085, 0.4305182993412018, 0.5111054182052612, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4361092150211334, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06514547765254974, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.21120649576187134, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.32620373368263245, 0.904705286026001, 0.0, 0.0, 0.0, 0.09315355867147446, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6515793204307556, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.7480322122573853, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.647281527519226, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4045291244983673, 0.0, 0.0, 0.0, 0.0, 3.648926258087158, 0.0, 0.0, 0.0, 0.0]] Response/Inference for the above image is : [129] ----------------------------------------------------------------------
print(predictor.endpoint_name)
endpoint_name = predictor.endpoint_name
pytorch-inference-2023-02-02-22-15-13-553
# Solution 2: Using boto3
# Using the runtime boto3 client to test the deployed models endpoint
import os
import io
import boto3
import json
import base64
import PIL
# setting the environment variables
ENDPOINT_NAME = endpoint_name
# We will be using the AWS's lightweight runtime solution to invoke an endpoint.
runtime= boto3.client('runtime.sagemaker')
test_dir = "./dogImages/test/129.Tibetan_mastiff/"
test_images = ["Tibetan_mastiff_08158.jpg", "Tibetan_mastiff_08139.jpg", "Tibetan_mastiff_08138.jpg"]
test_images_expected_output = [129, 5, 21 ]
for index in range(len(test_images) ):
test_img = test_images[index]
expected_breed_category = test_images_expected_output[index]
print(f"Test image no: {index+1}")
test_file_path = os.path.join(test_dir,test_img)
with open(test_file_path , "rb") as f:
payload = f.read()
print("Below is the image that we will be testing:")
display(Image.open(io.BytesIO(payload)))
print(f"Expected dog breed category no : {expected_breed_category}")
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='image/jpeg',
Body=payload)
response_body = np.asarray(json.loads( response['Body'].read().decode('utf-8')))
print(f"Response: {response_body}")
predicted_dog_breed = np.argmax(response_body,1) + 1 #We need to do plus 1 as index starts from zero and prediction is zero-indexed
print(f"Response/Inference for the above image is : {predicted_dog_breed}")
Test image no: 1 Below is the image that we will be testing:
Expected dog breed category no : 129 Response: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.66240817 0.05551017 0. 0. 0. 0. 0. 0. 0. 0. 0.21218474 0. 0. 0. 0. 0.91463864 0.57808369 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.14859056 0. 0.01837325 0. 0. 0. 0.33908087 0. 0. 0. 0.85083079 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.67431808 0. 0. 1.45760918 0.92072803 0. 0. 0. 0.9517051 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.77396184 0.59833473 0. 0. 0.2721315 0. 0. 0. 0. 0. 0. 1.91794884 0. 0. 0. 0. 0. 1.06625903 0. 0. 0. 0. 0.57793838 0.51399088 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2.42847252 0. 0. 0. 0. ]] Response/Inference for the above image is : [129] Test image no: 2 Below is the image that we will be testing:
Expected dog breed category no : 5 Response: [[0. 0.08648087 0.07022417 0. 0. 0. 0. 0. 0. 0. 0. 0.51090145 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.01123343 0. 0. 0. 0. 1.51232922 0.76757598 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.2232208 0. 0.40191734 0. 0. 0. 0.82659227 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.08851225 0. 1.51812339 0. 0. 0. 0. 0. 0. 0. 3.39089918 0. 0. 1.40064216 1.33655155 0. 0. 0. 0.69951218 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.99680603 1.38870728 0. 0. 0.30575681 0. 0. 0. 0. 0. 0. 1.5242455 0. 0. 0. 0. 0. 2.05748034 0. 0. 0. 0. 0.48372883 0.58211809 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2.22693157 0. 0. 0. 0. ]] Response/Inference for the above image is : [76] Test image no: 3 Below is the image that we will be testing:
Expected dog breed category no : 21 Response: [[0. 0.47988081 0.05340964 0. 2.0291307 0. 0. 0. 0. 0. 0. 0.85794872 0. 0. 0. 0. 0. 0. 0. 0. 0.42556289 0.54180181 0. 0. 0. 0.19376616 0.4305183 0.51110542 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.43610922 0. 0. 0. 0. 0. 0.06514548 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.2112065 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.32620373 0.90470529 0. 0. 0. 0.09315356 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.65157932 0. 0. 0. 0. 0. 0. 1.74803221 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.64728153 0. 0. 0. 0. 0. 0. 0.40452912 0. 0. 0. 0. 3.64892626 0. 0. 0. 0. ]] Response/Inference for the above image is : [129]
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()